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Abstract 



In the last few years, many different performance measures have been introduced to overcome the weakness of the most natural 
^— ^ metric, the Accuracy. Among them, Matthews Correlation Coefficient has recently gained popularity among researchers not only 
I in machine learning but also in several application fields such as bioinformatics. Nonetheless, further novel functions are being 
proposed in literature. We show that Confusion Entropy, a recently introduced classifier performance measure for multi-class 
problems, has a strong (monotone) relation with the multi-class generalization of a classical metric, the Matthews Correlation 
to/J Coefficient. Computational evidence in support of the claim is provided, together with an outline of the theoretical explanation. 
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One of the major task in machine learning is the com- 
parison of classifiers' performance. This com parison can be 
carried out either by me ans of statistical tests ( Demsaii [2006: 



Garcia & Herreral 120081) or using a performance measure as 
an indicator to derive similarities and differences. For binary 
problems, a number of meaningful metrics are available and 
their properties are well understood. On the other hand, the 
definition of performance measures in the context of multi- 
class classification is still an open research topic, although 
seve ral functions have been proposed in the last f ew years: 
see dSokolova & Lapalmd. [20091 iFerri et all [2009) for two 
comparing reviews, (Felkinl 120071) for a discussion of the 
differences between the use o f the same classifier on a binary 
and a multi-class task and dDiri & Albavrakl 120081) for an 
alternative graphical comparison approach. As an example, 
one of the most important measur es for binary classifier , 
jthe Area Unde r the Curve (AUC) dHanlev & McNeill Il982t 
Bradley . 1997 ) associated to the Receiver Operating Charac- 
teristic curve has no automatic extension to the multi-class 
case. Although an agreed re asonably ave r age-ba sed build 
extension exists (presented in dHand & Tilll l200lb ). several 
alternative formulations are being presented, either based on a 
multi-class ROC a p proximation (lEverson & Fieldsendl [2006; 
Landgrebe & Duirl 120051 l2006i |2008j)) or by viewing the 



ROC as a surface whose volume (Volume Under the Surface 
VUS) has to be comput ed (by exac t integra t ion or polynomial 
appro xim ation) as in dFerri et all [2003; Van Calster et al 



20081: Q [2009). Other measures are more naturally defined, 
starting from the accuracy (ACC, i.e. the fraction of correctly 
predicted samples) and the similar Global Performance Index 
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dFreitas et al. , l2007albk to the Matthews correlation coeffi- 
cient (MCC). This latter function was introduced in ([Matthews, 
1975J) and it is also known as the ^-coefficient, corresponding 



for a 2 x 2 contingency table to the square root of the average 
X 1 statistic ^x 2 l n - MCC has rece ntly attracted the a ttention of 
the machine learning community (Bald i et al. , 2000) as one of 
the best method to summarize into a single value the confusion 
matrix of a binary classification task. Its use as one of the pre- 
ferred classifier performance measure as increased since then, 
and for instance it has been chosen (together with AUC) as the 
elective metric in the US FDA-led initiative MAQC-II aimed 
at reaching consensus on the best practices for development 
and validation of predictive models based on microarray gene 
expression and genotyping data for personalized me dicine 



( The MicroArrav Quality Control (MAOC) Consortium , 



20101). A ge n eral ization to the multi-class case was defined 



in iGorodkirJ. |2004. lat er used also for comparin g network 



topologies (Suppe r et all |2007; Stoki c et all 120091) . Finally, 



another interesting set of measures that have a natural definition 
for multi-class confusion matrices consists of the functions 
derived from the concept of (informati on) entropy, fi rst intro- 
duced by Shannon in his famous paper ( Shannonl 19481) . Many 
measure have been defined in the classification framework 
based on the entropy functi on, from simpl er ones such as the 
confusion matrix entropy dvan Sonl 119941) . to more complex 
expressions as the transmitter information (Abrams pnl 1963 ) 
or the relative classifier information (RCI) (ISindhwani et al. , 
200 II) . A novel multi-class measure belonging to this set 
has been recently introduced under the name of Confus ion 



Entropy (CEN) by Wei and colleagues in dWei et all [20 1 Oallbh 



in this work, the authors compare their measure to RCI and 
accuracy, and they prove CEN to be superior in discriminative 
power and precision to both alternatives in terms of two 
statistical indicator calle d degree of consistenc y and degree of 
discriminacy, defined in dHuang & Lingl 120051) . 
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In the present work we investigate the similarity between 
Confusion Entropy and Matthews correlation coefficient. In 
particular, we experimentally show that the two measures are 
strongly correlated, and their relation is globally monotone and 
locally almost linear. Moreover, we provide a brief outline of 
the mathematical links between CEN and MCC. 



2. Confusion Entropy and Matthews Correlation Coeffi- 
cient 

Given a classification problem on S samples S — {s, ; : 1 < 
i < S) and N classes [l,...,N], define the two functions 
tc, pc: S — > {l,...,N} indicating for each sample s its true 
class tc(s) and its predicted class pc(s), respectively. The cor- 
responding confusion matrix is the square matrix C e M(N x 
N, N) whose ij-th entry Cy is the number of elements of true 
class i that have been assigned to class j by the classifier: 

Cjj = \{s € S: tc(s) = i and pc(s) = j}\ . 



This measure ranges between (perfect classification) and 1 for 
the extreme misclassification case Cy = (1 — Sij)F, for F e N 
(this holds for N > 2, while it is not true anymore for N = 2, 
see Subsec lXTT i. 

Let X,Y e M(S x N, F 2 ) be two matrices where X sn = 1 if 
the sample s is predicted to of class n (pc(s) = n) and X m = 
otherwise, and Y m — 1 if sample s belongs to class n (tc(s) = n) 
and otherwise. Using Kronecker's delta function, the defini- 
tion becomes: 

X = (5 P c(j),«) m Y = (5 tC ( S ) ; „) jn . 

Then the Matthews Correlation Coefficient MCC can be defined 
as the ratio: 



MCC = 



co\(X, Y) 



Vcov(X,X) ■ cov(T,F) 

where cov(-, ■) is the covariance function. In terms of the con- 
fusion matrix, the above equation can be written as: 



The most natural performance measure is the accuracy, defined 
as the ratio of the correctly classified samples over all the sam- 
ples: 



ACC 



k=i 



k=\ 



N 



In information theory, the entropy H associated to a random 
variable X is the expected value of the self-information / of X: 

H(X) = E(I(X)) = Yj ^« = - Z p(x) l °£b(P(x)) . 
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where p(x) is the probability mass function of X, with the 
position hb{x) = for p(x) - 0, motivated by the limit 
limxlog(x) = 0. 

x— >0 

The Confusi on Entropy measur e CEN for a confusion matrix 
C is defined in (IWei et aUuOlOal) as: 



(1) 
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where the misclassification probabilites P are defined as the fol- 
lowing ratios: 
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MCC lives in the range [-1, 1], where 1 is perfect classification, 
-1 is reached in the alternative extreme misclassification case 
of a confusion matrix with all zeros but in two symmetric en- 
tries Cjj, Cjj, and when the confusion matrix is all zeros but 
for one single column (all samples have been classified to be of 
a class k), or when all entries are equal Cjj — K e N. In this 
last case, the Confusion Entropy value is (l - log 2A ,_ 2 
when only a single column is not zero, the Confusion Entropy 
can assume many different values, depending on this column's 
entries. Note that both measures are invariant for scalar multi- 
plication of the whole confusion matrix. 

CEN is indeed more discriminant than MCC in some sit- 
uations, for instance when MCC = as mentioned above, 
or when the number of samples is relatively small and thus 
it more likely to have different confusion matrices with the 
same MCC and different CEN. This can be quantitatively as- 
sessed by using the degree of discrimination introduced in 
(Hua ng & Lingu2005l) : for two measures / and g on a domain 
"F, let P = {(a,b) e ¥ x f(a) > f(b),g(a) = g(b)} and 
Q = {(a,b) e ¥ x f(a) = f(b),g(a) > g{b)}\ then the 
degree of discriminancy for / over g is |P|/|<2|. For instance, 
in the 3-classes case with 2,4,3 samples respectively, the de- 
gree of discriminancy of CEN over MCC is about 6. A similar 
behaviour happens fo r all the 12 smal l sample size cases on 
three classes listed in dWei etallEoiOai Tab. 6), ranging from 
9 to 19 samples. In the same paper ( Huang & Lingl 2005), 
another indicator for comparing distances is defined, the de- 
gree of consistency: for two measures / and g on a domain 
>F, let R = \{a,b) efxf: f(a) > f(b),g(a) > g(b)} and 



2 



S = {(a, b)€ x ¥x^>: f(a) > f(b), g(a) < g(b)}; then the degree 
of consistency of / and g is |/?|/(|/?| + \S |). 

A quite different behaviour between the two measures can be 
highlighted in the following situation: consider the matrix Z4 
with all entries are equal but a non-diagonal one; because of the 
multiplicative invariance, we can set all entries to one but for the 
one in the leftmost lower corner: (Z^),) = 1 + £(y),(2v,i)GA - 1) 
for A > 1 a positive integer. When A grows bigger, more and 
more samples are misclassified: for instance, the corresponding 
accuracy reads ACC(Za) = N/(N 2 + A - 1), thus decreasing 
towards zero for increasing A. 

The MCC measure of this confusion matrix is 



MCC(Z A ) = - 



A- 1 



(N - l)(N 2 -2A-2)' 



which is a function monotonically decreasing for increasing 
values of A, with limit -1/(N - 1) for A — > oo. On the other 
hand, the Confusion Entropy for the same family of matrices is 



CEN(Z A ) 



1 



[<W-2)(iV-l)log 2JV _ 2 (2A0 



N 2 + A - 1 

+(2N + A - 3) \og 2N _ 2 {2N + A - 1) - A \og 2N _ 2 (A)] 



which is a decreasing function of increasing A, asymptotically 
moving towards zero, i.e., the minimal entropy case. Thus in 
this case, the behaviour of the Confusion Entropy is the oppo- 
site than the one of more classical measures such as MCC and 
accuracy. 

Analogously for the case of (perfectly) random classification 
on a unbalanced problem: because of the multiplicative invari- 
ance of the measures, we can assume that the confusion matrix 
for this case has all entries equal to one but for the last row, 
whose entries are all A, for A > 1. In this case, the Confusion 
Entropy is 

CEN = 2MN + A-1) [(2N + A ~ 3) log2 "- 2(2N + A ~ 1) 
-2A \og 2N _ 2 A + (A + 1) \og 2N _ 2 (N + NA + A - 1)] , 

which is a decreasing function for growing A whose limit for 
A — > oo is log 2A ,_ 2 N + 1 (as a function of N, this limit is an 
increasing function asymptotically growing towards 1 /2). 

One of the main features of the MCC measure is the fact 
that MCC=0 identifies all those case where random classifica- 
tion (i.e., no learning) happens: this is lost in the case of CEN, 
due to its greater discriminant power - there is no unique value 
associated to the wide spectrum of random classification. 

Consider now the confusion matrix B of dimension N where 
Bji = F + (T - F)8jj, i.e. all entries have value F but in the 
diagonal whose values are all T, for T, F two integers. In this 
case, 



MCC 



T 2 + (N - 2)TF - (N - l)F 2 
[T + (N - l)F] 2 



rc ,, (N-l)F 2[T + (N-l)F] 

CEN = log 9jv i 

T + (N - l)F B2N - 2 F 



and thus 

CEN = (1 - MCC) (l + lo g2N _ 2 7 ^"^ ) (l - jf 

This identity can be relaxed to the following generalization, 
which is a slight underestimate of the true CEN value: 



CEN =! - ■ (1 - MCC) 
k 



1 + lo g2iV-2 



£C (3) 



- • (1 - MCC) (1 - \og 2N _ 2 (l - ACQ) 1 - - 



where both sides are zero when MCC = ACC = 1, and k = 

1.012 • (l + - For simplicity sake, we call the 

right member of Eq. [^transformed MMC, tMCC for short. 

To show that the relation in Eq. [3] is valid in a wide range 
of situations, an experiment has been performed, whose result 
is graphically reported in Fig. [T] In details, 200.000 confusion 
matrices in dimensions ranging from 3 to 30 have been gen- 
erated with the following setup: the number correctly classi- 
fied elements (i.e., the diagonal elements) for each class has 
been (uniformly) randomly chosen between 1 and 1000, while 
each non-diagonal entry has been chosen as a random inte- 
ger between 1 and |1000p,J, where the ratio p, for the i-th 
matrix M,- was extracted from the uniform distribution in the 
range [0.01,1]. The correlation between tMCC and &-CEN 
is 0.9941477 and the degree of consistency is 1 - 10~ 7 (the 
degree of discriminancy is undefined since no ties occurred). 
In particular, the average ratio between tMMC and A:-CEN 
is 1.000508, with 95% bootstrap Student confidence interval 
(1.000328,1.000711). 

2.1. The binary case 

In the binary case of two classes positive (P) and negative 
(AO, the confusion matrix becomes ( pp ™ ), where T and F 
stands for true and false respectively. 

In this setup, the Matthews correlation coefficient has the fol- 
lowing shape: 



MCC 



TP ■ TN - FP ■ FN 



V(TP + FP) (TP + FN) (TN + FP) (TN + FN) 
Similarly, the Confusion Entropy can be written as: 

_ (FN + FP) log 2 ((TP + TN + FP + FN) 2 - (TP - TN) 2 ) 
CEN ~ 2(TP + TN + FP + FN) 

FNlog 2 FN + FPlog 2 FP 
TP + TN + FP + FN ' 

Note that in the case TP = TN = T and FP = FN = F, the 
Confusion Entropy reads 

CEN = — — - log 2 , 



T + F 




which is bigger than 1 when the ratio T/F is smaller than 1. 
This means that all the confusion matrices ( T F T ) with < T < 
F have a confusion entropy larger than 1, attained for the totally 
misclassified case T — 0. Such behaviour makes CEN unusable 
as a classifier performance measure in the binary case. 

3. Conclusions 

Accuracy, Matthews Correlation Coefficient and Confusion 
Entropy are three crucial performance measures for evaluating 
the outcome of a classification task, both on binary and multi- 
class problems (the fourth one is Area Under the Curve, when- 
ever a ROC curve can be drawn). Although they show a mutual 
consistent behaviour, each of them is better tailored to deal with 
different situations. 

Accuracy is by far the simplest one, and its role is to con- 
vey a first rough estimate of the classifier goodness. Its use is 
widespread among the scientific literature, but it suffers from 
several caveats, the most relevant being the inability to cope 
with unbalanced classes and thus the impossibility of distin- 
guish among different kinds of misclassifications. 

Confusion Entropy, on the other hand, is probably the finest 
measure and it shows an extremely high level of discriminancy 
even between very similar confusion matrices. However, this 
feature is not always welcomed, because it makes the interpre- 
tation of its value quite harder, expecially when considering sit- 
uations that are naturally very similar (e.g, all the cases with 
MCC=0). Moreover, CEN may show erratic behaviour in the 
binary case. 

In this spirit, the Matthews Correlation Coefficient is a good 
compromise between reaching a reasonable discriminancy de- 
gree among different cases, and the need for the practitioner of 



a easily interpretable value expressing the type of misclassifi- 
cation associated to the chosen classifier on the given task. We 
showed here that there is a strong linear relation between CEN 
and a logarithmic function of MCC regardless of the dimen- 
sion of the considered problem. Furthermore, MCC behaviour 
is totally consistent also for the binary case. 

This given, we can suggest MCC as the best off-the-shelf 
evaluating tool for general purpose tasks, while more subtle 
measures such as CEN should be reserved for specific topic 
where more refined discrimination is crucial. 
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